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Abstract 

Multi-task feature learning (MTFL) is a powerful technique in boosting the predictive per¬ 
formance by learning multiple related classification/regression/clustering tasks simultaneously. 
However, solving the MTFL problem remains challenging when the feature dimension is extremely 
large. In this paper, we propose a novel screening rule—that is based on the dual projection 
onto convex sets (DPC)—to quickly identify the inactive features —that have zero coefficients 
in the solution vectors across all tasks. One of the appealing features of DPC is that: it is 
safe in the sense that the detected inactive features are guaranteed to have zero coefficients in 
the solution vectors across all tasks. Thus, by removing the inactive features from the training 
phase, we may have substantial savings in the computational cost and memory usage without 
sacrificing accuracy. To the best of our knowledge, it is the first screening rule that is applicable 
to sparse models with multiple data matrices. A key challenge in deriving DPC is to solve a 
nonconvex problem. We show that we can solve for the global optimum efficiently via a properly 
chosen parametrization of the constraint set. Moreover, DPC has very low computational cost 
and can be integrated with any existing solvers. We have evaluated the proposed DPC rule on 
both synthetic and real data sets. The experiments indicate that DPC is very effective in iden¬ 
tifying the inactive features—especially for high dimensional data—which leads to a speedup 
up to several orders of magnitude. 


1 Introduction 

Empirical studies have shown that learning multiple related tasks (MTL) simultaneously often 
provides superior predictive performance relative to learning each tasks independently (Ando and 
Zhang, 2005, Argyriou et ah, 2008, Bakker and Heskes, 2003, Evgeniou et al., 2005, Zhang et ah, 
2006, Chen et ah, 2013). This observation also has solid theoretical foundations (Ando and Zhang, 
2005, Baxter, 2000, Ben-David and Schuller, 2003, Caruana, 1997), especially when the training 
sample size is small for each task. One popular MTL method especially for high-dimensional 
data is multi-task feature learning (MTFL), which uses the group Lasso penalty to ensure that 
all tasks select a common set of features (Argyriou et ah, 2007). MTFL has found great success 
in many real-world applications including but not limited to: breast cancer classification (Zhang 
et ah, 2010), disease progression prediction (Zhou et ah, 2012), gene data analysis (Kim and Xing, 
2009), and neural semantic basis discovery (Liu et ah, 2009a). A major issue in MTFL—that is of 
great practical importance—is to develop efficient solvers (Liu et ah, 2009b, Sra, 2012, Wang et ah, 
2013a). However, it remains challenging to apply the MTFL models to large-scale problems. 
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The idea of screening has been shown to be very effective in scaling the data and improving the 
efficiency of many popular sparse models, e.g., Lasso (El Ghaoui et ah, 2012, Wang et al., 2013b, 
Wang et ah, Xiang et ah, 2011, Tibshirani et ah, 2012), nonnegative Lasso Wang and Ye (2014), 
group Lasso (Wang et ah, 2013b, Wang et ah, Tibshirani et ah, 2012), mixed-norm regression 
(Wang et ah, 2013a), ^-regularized logistic regression (Wang et ah, 2014b), sparse-group Lasso 
(Wang and Ye, 2014), support vector machine (SVM) (Ogawa et ah, 2013, Wang et ah, 2014a), and 
least absolute deviations (LAD) (Wang et ah, 2014a). Essentially, screening aims to quickly identify 
the zero components in the solution vectors such that the corresponding features—called inactive 
features (e.g., Lasso)—or data samples—called non-support vectors (e.g., SVM)—can be removed 
from the optimization. Therefore, the size of the data matrix and the number of variables to be 
computed can be significantly reduced, which may lead to substantial savings in the computational 
cost and memory usage without sacrificing accuracy. Compared to the solvers without screening, 
the speedup gained by the screening methods can be several orders of magnitude. 

However, we note that all the existing screening methods are only applicable to sparse models 
with a single data matrix. Therefore, motivated by the challenges posed by large-scale data and the 
promising performance of existing screening methods, we propose a novel framework for developing 
effective and efficient screening rules for a popular MTFL model via the dual projection onto convex 
sets (DPC). The framework of DPC extends the state-of-the-art screening rule, called EDPP (Wang 
et al.), for the standard Lasso problem (Tibshirani, 1996)—that assumes a single data matrix—to 
a popular MTFL model—that involves multiple data matrices across different tasks. To the best 
of our knowledge, DPC is the first screening rule that is applicable to sparse models with multiple 
data matrices. 

The DPC screening rule detects the inactive features by maximizing a convex function over 
a convex set containing the dual optimal solution, which is a nonconvex problem. To find the 
region containing the dual optimal solution, we show that the corresponding dual problem can 
be formulated as a projection problem—which admits many desirable geometric properties—by 
utilizing the bilinearity of the inner product. Then, by a carefully chosen parameterization of the 
constraint set, we transform the nonconvex problem to a quadratic programming problem over one 
quadratic constraint (QP1QC) (Gay, 1981), which can be solved for the global optimum efficiently. 
Experiments on both synthetic and real data sets indicate that the speedup gained by DPC can be 
orders of magnitude. Moreover, DPC shows better performance as the feature dimension increases, 
which makes it a very competitive candidate for the applications of very high-dimensional data. 

We organize the rest of this paper as follows. In Section 2, we briefly review some basics of 
a popular MTFL model. Then, we derive the dual problem in Section 3. Based on an indepth 
analysis of the geometric properties of the dual problem and the dual feasible set, we present the 
proposed DPC screening rule in Section 4. In Section 5, we evaluate the DPC rule on both synthetic 
and real data sets. We conclude this paper in Section 6. Please refer to the supplement for proofs 
not included in the main text. 

Notation: Denote the £ 2 norm by || • ||. For x £ M n , let its i th component be Xj, and the 
diagonal matrix with the entries of x on the main diagonal be diag(x). For a set of positive 
integers {N t : t. = 1,..., T, Ylt=i = N}, we denote the t th subvector of x £ by x t such 
that x = (xf,... ,x^f) T , where x t £ R Nt for t = 1,..., T. For vectors x, y £ M n , we use (x, y) 
and x T y interchangeably to denote the inner product. For a matrix M £ M mxn , let m', m.,, 
and mij be its i th row, j th column and ( i,j) th entry, respectively. We define the (2, l)-norm of 
M by ||M|| 2 ,i = x ||m*||. For two matrices A, B £ R mxn , we define their inner product by 
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(A, B) = tr(A T B). Let I be the identity matrix. For a convex function /(•), let df(-) be its 
subdifferential. For a vector x and a convex set C, the projection operator is: 

Pc(x) := argmin ygC | ||y - x||. 


2 Basics 


In this section, we briefly review some basics of a popular MTFL model and mention several 
equivalent formulations. 

Suppose that we have T learning tasks {( X t , yt) : t = 1,..., T}, where X t G R NtXd is the data 
matrix of the t th task with Nf samples and d features, and y t G is the corresponding response 
vector. A widely used MTFL model (Argyriou et ah, 2007) takes the form of 

min _ Xl i Illy* “ X t w t\\ 2 + M\W\\ 2tl , (1) 

we ® dxT ^t=i z 


where wj G M. d is the weight vector of the t th task and W = (wj,..., wy). Because the || • j^p-norm 
induces sparsity on the rows of W, the weight vectors across all tasks share the same sparse pattern. 
We note that the model in (1) is equivalent to several other popular MTFL models. 

The first example introduces a positive weight parameter pt for t = 1,..., T to each term in the 
loss function: 


min 

WeR dxT 


Y,U ^~ x ^\\ 2 + x \\ w hx 


which reduces to (1) by setting y t = and X t = 

The second example introduces another regularizer to (1): 

^min t XL ill yt ~ X t™t II 2 + A HP^II2,i + p\\W\\ 2 F , 

where p is a positive parameter and || • || f is the Frobenius norm. Let I G M. dxd be the identity 
matrix and 0 be the d-dimensional vector with all zero entries. By letting 

W = (A t r , v / W ) T , y* = (yt T , 0 T ) T , t = l, ..., t, 

we can also simplify the above MTFL model to (1). 

In this paper, we focus on developing the DPC screening rule for the MTFL model in (1). 


3 The Dual Problem 

In this section, we show that we can formulate the dual problem of the MTFL model in (1) as a 
projection problem by utilizing the bilinearity of the inner product. 

We first introduce a new set of variables: 


z t = y t ~ X t -w t , t = l,...,T. 


( 2 ) 
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Then, the MTFL model in (1) can be written as 

min XT i + M\WKi, ( 3 ) 

W,z z — J t= 1 

s.t. z t = y t - X t w t , t = 1,... ,T. 

Let A 9 € be the vector of Lagrangian multipliers. Then, the Lagrangian of (1) is 

L(fT,z;0)=^ =i i||z t || 2 + A||fT|| 2)1 (4) 

_ 'J' 

+ A ^2 t=1 (Qt, yt - x t w t - z t ). 

To get the dual problem, we need to minimize L(W, z; 9) over IF and z. We can see that 

0 = V z L(W, z; 9) => argmin z L(W, z; 9) = A 9. (5) 

For notational convenience, let 

f(W) = X\\W\\ 2 ,i - \J2 T t=1 (0t,Xm). 

Thus, to minimize L(W,z]9) with respect to IF, it is equivalent to minimize /(IF), i.e., 

{IF : 06 d w L(W,z;9)} = {IF : 0 G df(W)}. 

By the bilinearity of the inner product, we can decouple /(IF) into a set of independent subproblems. 
Indeed, we can rewrite the second term of /(IF) as 


t) = J 2 T t=l (X?e t , w t ) = (M, IF), 


( 6 ) 


where M = (Xf9 \,..., X^9t)- Eq. (6) expresses (M, IF) by the sum of the inner products of the 
corresponding columns. By the bilinearity of the inner product, we can also express (M, IF) by the 
sum of the inner products of the corresponding rows: 


(7) 


( 8 ) 


J2 t= :1 ( 6t ’ X t™t) = (M,W) = 

Denote the j th column of Xf by x^. We can see that 

m< = (( x £ 1) ’ 6, i)>( x f ) > 0 2),---,(x^ T) ,6' r )). 

Moreover, as ||IF|| 2) i = Yle=i ||w^||, Eqs. (7) implies that: 

f{w) = \'£ d e= 1 f i H* e ), 

where /M(w^) = ||w f || — (m f ,w^). Thus, to minimize /(IF), we can minimize each /W(w^) 
separately. The subdifferential counterpart of the Fermat’s rule (Bauschke and Combettes, 2011), 
i.e., 0 e dfWfw £ ), yields: 


n / : e 


V/||w* 


ifw* + 0, 
{u G R d : 1 1u| < 1}, ifw^ = 0, 


(9) 
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where is the minimizer of /^(-). 

We note that Eq. (9) implies ||nr|| < 1. If this is not the case, then is not lower bounded 
(see the supplements for discussions), i.e., min w c / f (w £ ) = — oo. Thus, by Eqs. (5) and (9), the 
dual function is 

q{9) = minw,z L(W, z; 9) (10) 

= f — ^||6»|| 2 + A((9,y), ||m f |! < 1, Ml € 
l—oo, otherwise. 

Maximizing q{9) yields the dual problem of (1) as follows: 

max 111^II 2 - ¥ ||i -#|| 2 , (11) 

s.t. ( x ? } , 9 t ) 2 < 1, £ = 1,..., d. 

It is evident that the problem in (11) is equivalent to 

m e in III A- 6 'f’ ( 12 ) 

s.t. ^2^ =1 (^e\o t ) 2 <l,£=l 

In view of (12), it is indeed a projection problem. Let T be the feasible set of (12). Then, the 
optimal solution of (12), denoted by 9*( A), is the projection of y/A onto T, namely, 

r(A) = p^(|). (13) 


4 The DPC Rule 

In this section, we present the proposed DPC screening rule for the MTFL model in (1). Inspired 
by the Karush-Kuhn-Tucker (KKT) conditions (Giiler, 2010), in Section 4.1, we first present the 
general guidelines. The most challenging part lies in two folds: 1) we need to estimate the dual 
optimal solution as accurately as possible; 2) we need to solve a nonconvex optimization problem. 
In Section 4.2, we give an accurate estimation of the dual optimal solution based on the geometric 
properties of the projection operators. Then, in Section 4.3, we show that we can efficiently solve 
for the global optimum to the nonconvex problem. We present the DPC rule for the MTFL model 
(1) in Section 4.4. 


4.1 Guidelines for Developing DPC 

We present the general guidelines to develop screening rules for the MTFL model (1) via the KKT 
conditions. 

Let IK*(A) = (w^(A),..., Wy(A)) be the optimal solution (1). By Eqs. (2), (5) and (9), the 
KKT conditions are: 


yt = X t Wf (A) + \9l (A), t = 1,..., T, 
'h if(w*)*(A) t^O, 
[-1,1], if (w^)*(A) = 0, 


9e(0*( A))e 


= l. 


,d. 


(14) 

(15) 
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where (w^)*(A) is the I th row of W*( A), and 

9i{®) = £ = l,...,d. (16) 

For £ = 1,..., d, Eq. (15) yields 

9t(0*W) < 1 =*► (w £ )*(A) = 0. (R) 

The rule in (R) provides a method to identify the rows in IF* (A) that have only zero entries. 
However, (R) is not applicable to real applications, as it assumes knowledge of 9*( A), and solving 
the dual problem (12) could be as expensive as solving the primal problem (1). Inspired by SAFE 
(El Ghaoui et ah, 2012), we can first estimate a set © that contains 0*( A), and then relax (R) as 
follows: 


max 0 e © g e (d) < 1 => (w /: )*(A) = 0, i = 1,..., d. (R*) 

Therefore, to develop a screening rule for the MTFL model in (1), (R*) implies that: 1) we need 
to estimate a region ©—that turns out to be a ball (please refer to Section 4.2)—containing 6* (A); 
2) we need to solve the maximization problem—that turns out to be nonconvex (please refer to 
Section 4.3) — on the left hand side of (R*). 

4.2 Estimation of the Dual Optimal Solution 

Based on the geometric properties of the dual problem (12) that is a projection problem, we first 
derive the closed form solutions of the primal and dual problems for specific values of A in Section 
4.2.1, and then give an accurate estimation of 6*{ A) for the general cases in Section 4.2.2. 

4.2.1 Closed form solutions 

The primal and dual optimal solutions IF* (A) and 9*( A) are generally unknown. However, when 
the value of A is sufficiently large, we expect that IF*(A) = 0, and 9*( A) = j by Eq. (14). The 
following theorem confirms this. 

Theorem 1. For the MTFL model in (1), let 

A m ax = f max rf y£*i< x ?\y> 2 . (17) 

Then, the following statements are equivalent: 

l £j«r(A) = 1^ IF* (A) = 0 A > A max . 

Remark 1. Theorem 1 indicates that: both the primal and dual optimal solutions of the MTFL 
model (1) admit closed form solutions for A > A max . Thus, we will focus on the cases with A € 
( 0 , A max ) in the rest of this paper. 
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4.2.2 The general cases 

Theorem 1 gives a closed form solution of 9* (A) for A > A max . Therefore, we can estimate 9*( A) 
with A < A max in terms of a known 0*(Ao). Specifically, we can simply set Ao = A max and utilize 
the result 0*(A max ) = y/A ma x . To make this paper self-contained, we first review some geometric 
properties of projection operators. 

Theorem 2. (Ruszczyhski, 2006) Let C be a nonempty closed convex set. Then, for any point u, 
we have 


u = Pc(u) <S> u - u 6 Nc{u), 

where Nq{ u) = {v : (v,u' — u) < 0, Vu 7 G C} is called the normal cone to C at u G C. 

Another useful property of the projection operator in estimating 9*( A) is the so-called firmly 
nonexpansiveness. 

Theorem 3. (Bauschke and Combettes, 2011) Let C be a nonempty closed convex subset of a 
Hilbert space TL. The projection operator with respect to C is firmly nonexpansive, namely, for any 
ui, u 2 G TL, 

]|Pc( Ul ) - Pc(u 2 )|| 2 + 11(1 - Pc)(Ui) - (/ - Pc)(u 2 )|| 2 

<||ui-u 2 || 2 . (18) 


The firmly nonexpansiveness of projection operators leads to the following useful result. 

Corollary 4. Let C be a nonempty closed convex subset of a Hilbert space H and 0 G C. For any 
u G TL, we have: 

1. ||Pc(u)|| 2 + ||u-p c (u)|| 2 < ||u|| 2 . 

2. (u,u- Pc(u)) > 0. 

Remark 2. Part 1 of Corollary 4 indicates that: if a closed convex setC contains the origin, then, 
for any point u, the norm of its projection with respect to C is upper bounded by the norm of ||u||. 
The second part is a useful consequence of the first part and plays a crucial role in the estimation 
of the dual optimal solution (see Theorem 5). 

We are now ready to present an accurate estimation of the dual optimal solution 9*( A). 

Theorem 5. For the MTFL model in (1), suppose that 9*{ Ao) is known with Ao G (0, A max ]. Let 
gn be given by Eq. (16) for t = 1,..., d, and 

4 G {argmax £=li d ^(y)} . (19) 

For any A G (0, Ao), we define 


n(A 0 ) 

r(A, A 0 ) 
r± (A, A 0 ) 



0*(A O ), 



if Ao G (0, A max ), 
if Aq = Amax- 


= *-**( Ao), 

^ ^ (n(Ao),r(A,Ao)) 

n„n.Mi 2 


n(A 0 ). 


( 20 ) 

( 21 ) 

( 22 ) 
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Then, the following holds: 

1. n(A) g Njr(6*(X)), 

2. (y, n(A 0 )) > 0, 

3. (r(A, A 0 ),n(A 0 )) > 0, 

4. ||0*(A) - (0*(A O ) + |r ± (A, A 0 ))|| < | ||r J -(A,A 0 )|| . 

Consider Theorem 5. Part 1 characterizes 6*( A) via the normal cone. Parts 2 and 3 illustrate 
key geometric identities that lead to the accurate estimation of 9*( A) in part 4 (see supplement for 
details). 

Remark 3. The estimation of the dual optimal solution in DPC and EDPP (Wang et al.) —that 
is for Lasso—are both based on the geometric properties of the projection operators. Thus, the 
formulas of the estimation in Theorem 5 are similar to that of EDPP. However, we note that the 
estimations in DPC and EDPP are determined by the completely different geometric structures of 
the corresponding dual feasible sets. Problem (12) implies that the dual feasible set of the MTFL 
model (1) is much more complicated than that of Lasso—which is a polytope (the intersection of a 
set of closed half spaces ). Therefore, the estimation of the dual optimal solution in DPC is much 
more challenging than that of EDPP, e.g., we need to find a vector in the normal cone to the dual 
feasible set at y/A max [see n(A max )]. 

For notational convenience, let 

o(A,A 0 ) = r(A 0 ) +V(A,A 0 ). (23) 

Theorem 5 implies that 0*( A) lies in the ball: 

0(A, Ac) = : \\9 — o(A, A 0 )|| < ^||r ± (A,Ao)||} . (24) 


4.3 Solving the Nonconvex Problem 


In this section, we solve the optimization problem in (R*) with © given by ©(A, Ao) [see Eq. (24)], 
namely, 

A 0 ) = max (g*(0) = V] (x.f\ 0 t ) 2 \ . (25) 

6>e©(A,A 0 ) ^ t—i J 

Although ge(-) and ©(A, Ao) are convex, problem (25) is nonconvex, as it is a maximization problem. 
However, we can efficiently solve for the global optimal solutions to (25) by transforming it to a 
QP1PC via a parametrization of the constraint set. We first cite the following result. 


Theorem 6. (Gay, 1981) Let H be a symmetric matrix and D be a positive definite matrix. 
Consider 


min if (u) 
Du||<A 


1 

2 


u t Hu + q T u, 


(26) 


where A > 0. Then, u* minimizes V’(ii) over the constraint set if and only if there exists a* > 0— 
that is unique—such that (H + a*D T D) u* is positive semidefinite, 

(H + a*D T D) u* = -q, (27) 

||.Du*|| = A, if a* > 0. (28) 
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We are now ready to solve for si(A, Ao). 

Theorem 7. Let o = o(A, Ao) andu* be the optimal solution of problem (26) with A = ||jr i '(A, Ao)||, 
D = I, 

H = - diag(2||x||f \, 2||x||J T) ), 
q = - (2||xJ 1) |||(xJ 1) ,Oi)|,...,2||xf ) |||(xf ) ,o T )|) , 

namely, there exists a a* > 0 such that a* and u* solve Eqs. (27) and (28). Let 

pi = max T ||x^||, 1 1 = jt* : ||x^* ) || = p e j . 


Then, the following hold: 

1. a* is unique, and a* > 2pf. 

2. We define u £ M T by 


u t = 


-qt/{h a + 2pi), ift^li, 
0, otherwise. 


Then, we have 


a E 


2 pi, i/||u|| < A, and {x.f*\o u ) = 0, fort * £ l t , 
[2pi,oo), otherwise. 


3. Let V = {v £ M t : vt = 0 fort ^ Zq ||u + v|| = A}. Then, we have 


u £ 


u + V, V £ V, if a* = 2pi, 
— (H + a*!)^ 1 q, otherwise. 


4. The maximum value of problem (25) is given by 

Si(X, A 0 ) = ( x ?\ °t) 2 + yA 2 - ^q T u*. 

Proof. We first transform problem (25) to a QP1PC by a parameterization of ©(A, Ao): 


0(A,A o ) 

( Ol + Uldl 


= < 


: ||u|| < r, ||6»i|| < 1.,/ - , 


[ \ 0 T + UtOt/ 


where u = (iq,..., ut) t ■ We define 


he(u,0) = gi 


( 


( Ol + Ul0l 
\0 T + Ut0T / 
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Thus, problem (25) becomes 


se(\ Ao) 


max < max hp(u,9 ) 

u||<A ({6>:||0 t ||<l,t=l,...,T} 


By the Cauchy-Schwartz inequality, for a fixed u, we have 


* (u) = mmiSSi. T) ht(uJ) 

= E), 1 “?ii x S‘ , ii 2 + 2 i“iiii4' , iii<4' l .°i)i + <4‘U> 2 - 

Let —V’(u) = Ylt Li u t ll x ^l| 2 + 2ut||x^|||(x^,o t )|. We can see that 

max l|u||<r <fi(u) = max|| u ||< r -?/>(u) + £ t=i <x?U> 2 . 

Thus, problem (25) becomes 

•st(A,A 0 ) = — minj| u ||< r ^(u) + ^ f=i (x^ } , o t ) 2 . 

Therefore, to solve (25), it suffices to solve problem (26) with A, D, H, and q as in the theorem. 
The statement follows immediately from Theorem 6. □ 

Remark 4. To develop the DPC rule, (R*) implies that we only need the maximum value of 
problem (25). Thus, Theorem 6 does not show the global optimal solutions. However, in view of 
the proof, we can easily compute the global optimal solutions in terms of a* and u*. 

Computing a* and u* Consider Theorem 7. If ||u|| < A and (xj/*\o t t ) = 0 for i* £ Ip, then 
a* and u* admit closed form solutions. Otherwise, a* is strictly larger than 2 pp, which implies that 
H + a*I is positive definite and invertible. If this is the case, we apply Newton’s method (Gay, 
1981) to find a* as follows. Let 

ip{ct) = ||(H + a/)^ 1 q|| _1 - A -1 . 


Because ip(-) is strictly increasing on (2pp,oo), a* is the unique root of ip(-) on (2pp,oo). Let 
ao = 2 pp. Then, the k th iteration of Newton’s method to solve <y?(cC) = 0 is: 


u k = - (H + a k -iI) X q, 

,2 l|Ufc||-A 


—Oik-1 + || U A:| 


AuT(H + a k ^iI) iuk' 


(29) 

(30) 


As pointed out by More and Sorensen (1983), Newton’s method is very efficient to find a* as p(a) 
is almost linear on (2 pp,oo). Our experiments indicates that five iterations usually leads to an 
accuracy higher than 10 -15 . 
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4.4 The Proposed DPC Rule 

As implied by R*, we present the proposed screening rule, DPC, for the MTFL model (1) in the 
following theorem. 

Theorem 8. For the MTFL model (1), suppose that 9*( Ao) is known with Ao € (0, A ma x ]. Then, 
we have 


se( A, Ao) < 1 => (w ( )*(A) = 0, A £ (0, Ao), 
where sy (A, Ao) is given by Theorem 7. 

In real applications, the optimal parameter value of A is generally unknown. Commonly used 
approaches to determine an appropriate value of A, such as cross validation and stability selection, 
need to solve the MTFL model over a grid of tuning parameter values Ai > A 2 > . .. > Xjc, which 
is very time consuming. Inspired by the ideas of Strong Rule (Tibshirani et al., 2012) and SAFE 
(El Ghaoui et ah, 2012), we develop the sequential version of DPC. Specifically, suppose that the 
optimal solution VF*(A&) is known. Then, we apply DPC to identify the inactive features of MTFL 
model (1) at Afc + i via W*(Afc). We repeat this process until all VF*(A/ C ), k = 1,..., /C are computed. 

Corollary 9. DPC For the MTFL model (1), suppose that we are given a sequence of parameter 
values A max = Ao > Ai > ... > Xjc■ Then, for any k = 1, 2,..., K, — 1, ifW*( A*,) is known, we have 

•st(Afc+i,Afc) < 1 =>- (w ( )*(Afc + i) = 0, 
where sy (A, Ao) is given by Theorem 7. 

We omit the proof of Corollary 9 since it is a direct application of Theorem 8. 


5 Experiments 

We evaluate DPC on both synthetic and real data sets. To measure the performance of DPC, we 
report the rejection ratio, namely, the ratio of the number of inactive features identified by DPC 
to the actual number of inactive features. We also report the speedup, i.e., the ratio of the running 
time of solver without screening to the running time of solver with DPC. The solver is from the 
SLEP package (Liu et ah, 2009c). For each data set, we solve the MTFL model in (1) along a 
sequence of 100 tuning parameter values of A equally spaced on the logarithmic scale of A/A max 
from 1.0 to 0.01. We only evaluate DPC since no existing screening rule is applicable for the MTFL 
model in (1). 

5.1 Synthetic Studies 

We perform experiments on two synthetic data sets, called Synthetic 1 and Synthetic 2, that are 
commonly used in the literature (Tibshirani et ah, 2012, Zou and Hastie, 2005). Both synthetic 1 
and Synthetic 2 have 50 tasks. Each task contains 50 samples. For t = 1,..., 50, the true model is 

y t = + O.Ole, e ~ N( 0,1). 

For Synthetic 1, the entries of each data matrix Xj are i.i.d. standard Gaussian with pairwise 
correlation zero, i.e., corr = 0. For Synthetic 2, the entries of each data matrix Xj 
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are drawn from i.i.d. standard Gaussian with pairwise correlation 0.5^ J "l, i.e., corr ^xj^, = 

0.5l® J "l. To construct w*, we first randomly select 10% of the features. Then, the corresponding 
components of w t * are populated from a standard Gaussian, and the remaining ones are set to 0. 
For both Synthetic 1 and Synthetic 2, we set the feature dimension to 10000, 20000, and 50000, 
respectively. For each setting, we run 20 trials and report the average performance in Fig. 1 and 
Table 1. 




Figure 1: Rejection ratios of DPC on two synthetic data sets with different feature dimensions. 

Fig. 1 shows the rejection ratios of DPC on Synthetic 1 and Synthetic 2. For all the six 
settings, the rejection ratios of DPC are higher than 90%, even for small parameter values. This 
demonstrates one of the advantages of DPC, as previous empirical studies (El Ghaoui et al., 2012, 
Tibshirani et al., 2012, Wang et al.) indicate that the capability of screening rules in identifying 
inactive features usually decreases as the parameter value decreases. Moreover, Fig. 1 also shows 
that as the feature dimension increases, the rejection ratios of DPC become higher —that is very 
close to 1. This implies that the potential capability of DPC in identifying the inactive features on 
high-dimensional data sets would be even more significant. 

Table 1 presents the running time of the solver with and without DPC. The speedup is very 
significant, which is up to 60 times. Take Synthetic 1 for example. When the feature dimension 
is 50000, the solver without DPC takes about 40.68 hours to solve problem (1) at 100 paramater 
values. In contrast, combined with DPC, the solver only takes less than one hour to solve the same 
100 problems—which leads to a speedup about 60 times. Table 1 also shows that the computational 
cost of DPC is very low—which is negligible compared to that of the solver without screening. 
Moreover, as the rejection ratios of DPC increases with feature dimension growth (see Fig. 1), 
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Table 1 shows that the speedup by DPC increases as well. 


5.2 Experiments on Real Data Sets 

We perform experiments on three real data sets: 1) the TDT2 text data set (Cai et ah, 2009); 
2) the animal data set (Lampert et ah, 2009); 3) the Alzheimers Disease Neuroimaging Initiative 
(ADNI) data set (http://adni.loni.usc.edu/). 



Figure 2: Rejection ratios of DPC on three real data sets. 

The Animal Data Set The data set consists of 30475 images of 50 animals classes. By 
following the experiment settings in Kang et ah (2011), we choose 20 animal classes in the data 
set: antelope, grizzly-bear, killer-whale, beaver, Dalmatian, Persiancat, horse, german- shepherd, 
blue-whale, Siamese-cat, skunk, ox, tiger, hippopotamus, leopard, moose, spidermonkey, humpback- 
whale, elephant, and gorilla. We construct 20 tasks, where each of them is a classification task of 
one type of animal against all the others. For the t th task, we first randomly select 30 samples 
from the t th class as the positive samples; and then we randomly select 30 samples from all the 
other classes as the negative samples. We make use of all the seven sets of features kindly provided 
by Lampert et ah (2009): color histogram features, local self-similarity features, PyramidHOG 
(PHOG) features, SIFT features, colorSIFT features, SURF features, and DECAF features. Thus, 
each image is represented by a 15036-dinrensional vectors. Hence, the data matrix Xt of the f 4 /i 
task is of 60 x 15036, where t = 1,..., 20. 

The TDT2 Data Set The original data set contains 9394 documents of 30 categories. Each 
document is represented by a 36771-dimensional vector. Similar to the Animal data set, we con¬ 
struct 30 tasks, each of which is a classification task of one category against all the others (Amit 
et ah, 2007). Also, for the t th task, we first randomly select 50 samples from the t th category as 
the positive samples, and then we randomly select 50 samples from all the other categories as the 
negative samples. Moreover, we remove the features that have only zero entries, thus leaving us 
24262 features. Hence, the data matrix Xf of the tfh task is of 100 x 24262, where t = 1,..., 30. 

The ADNI Data Set The data set consists of 747 patients with 504095 single nucleotide 
polymorphisms (SNPs), and the volume of 93 brain regions for each patient. We first randomly 
select 20 brain regions. Then, for each region, we randomly select 50 patients, and utilize the 
corresponding SNPs data as the data matrix and the volumes of that brain region as the response. 
Thus, we have 20 tasks, each of which is a regression task. The data matrix Xt of the t th task is of 
50 x 504095, where t = 1,..., 20. 
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Table 1: Running time (in minutes) for solving the MTFL model (1) along a sequence of 100 tuning 
parameter values of A equally spaced on the logarithmic scale of A/A max from 1.0 to 0.01 by (a): 
the solver (Liu et ah, 2009c) without screening (see the third column); (b): the solver with DPC 
(see the fifth column). 



d 

solver 

DPC 

DPC+solver 

speedup 

Synthetic 1 

10000 

405.75 

0.7 

28.12 

14.43 

20000 

913.70 

1.36 

37.02 

24.68 

50000 

2441.57 

3.50 

42.08 

58.03 

Synthetic 2 

10000 

406.85 

0.70 

29.28 

13.89 

20000 

906.09 

1.37 

36.66 

24.72 

50000 

2435.38 

3.46 

44.78 

54.39 

Animal 

15036 

311.71 

0.47 

16.36 

19.05 

TDT2 

24262 

958.66 

1.87 

44.11 

21.74 

ADNI 

504095 

9625.58 

21.13 

35.34 

272.37 


Fig. 2 shows the rejection ratios of DPC—that are above 90%—on the aforementioned three 
real data sets. In particular, the rejection ratios of DPC on the ADNI data set are higher than 99% 
at the 100 parameter values. Table 1 shows that the resulting speedup is very significant—that 
is up to 270 times. We note that the feature dimension of the ADNI data set is more than half 
million. Without screening, Table 1 shows that the solver takes about seven days (approximately 
one week ) to compute the MTFL model (1) at 100 parameter values. However, integrated with the 
DPC screening rule, the solver computes the 100 solutions in about half an hour. The experiments 
again indicate that DPC provides better performance (in terms of rejection ratios and speedup) for 
higher dimensional data sets. 


6 Conclusion 

In this paper, we propose a novel screening method for the MTFL model in (1), called DPC. The 
DPC screening rule is based on an indepth analysis of the geometric properties of the dual problem 
and the dual feasible set. To the best of our knowledge, DPC is the first screening rule that is 
applicable to sparse models with multiple data matrices. DPC is safe in the sense that the identified 
features by DPC are guaranteed to have zero coefficients in the solution vectors across all tasks. 
Experiments on synthetic and real data sets demonstrate that DPC is very effective in identifying 
the inactive features, which leads to a substantial savings in computational cost and memory usage 
without sacrificing accuracy. Moreover, DPC is more effective as the feature dimension increases, 
which makes DPC a very competitive candidate for the applications of very high-dimensional data. 
We plan to extend DPC to more general MTFL models, e.g., the MTFL models with multiple 
regularizers. 
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A Discussions regarding to the Dual Problem of (1) 

Although Eq. (9) implies that ||irr|| < 1, this might not be the case. Thus, we need to consider the 
following two cases. 

(i) If Eq. (9) holds, we can see that (m /: , w^) = ||w^|| and thus 

min /^(w^) = 0. (31) 

w' 


Therefore, we have 


min /(IT) = 0. (32) 


(ii) If Eq. (9) does not hold, i.e., ||m /: || > 1, we would have 

inf f( e \ w £ ) = —oo, 


and thus 


min f(W) = — oo. 

w y ’ 

To see this, we define w^(i) = and thus 

(m^,w^/)) =f||m (: ||. 

Then, we have 

/> , >(w'(i)) = ((l-||m'||). 
Because ||rrr|| > 1, the above equation yields 

inf /^(w £ ) < lim /i^(w ^(t)) = — oo. 

t— >oo 


(33) 


(34) 


(35) 


(36) 


The above discussion implies that 

min /(IT) 

w K ’ 


0, if ||m / 1| < 1 , i = 1 ,..., d, 
—oo, otherwise. 


(37) 


B Proof of Theorem 1 

Proof. For notational convenience, let 
1 - e T 

2. 9*( A) = || 

3. IT*(A) = 0; 
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4. A > A max . 

Eq. (13) implies that 1 is equivalent to 2. 

(2 3) Suppose that 2 holds. Eq. (14) implies that Ww/(A) = 0 for t = 1,..., T. Denote 

the objective function of the MTFL model (1) by f(W). We claim that W*{ A) must be zero. To 
see this, let W (A) / 0 be another optimal solution of (1) and thus Tfw)(A) = 0 for t = 1,..., T. 
However, it is evident that f(W*( A)) < /(IT (A)). This leads to a contradiction. Thus, the optimal 
solution IT* (A) is zero and we have proved 2 =>• 3. The converse direction, i.e., 2 <;= 3 is a direct 
consequence of Eq. (14). 

(1 4) It is evident that 1 holds if and only if y/A is a feasible solution of problem (12), 

namely, all constraints in (12) holds at y/A. By plugging y/A into the constraints in (12), we can 
see that the feasibility of y/A is equivalent to 4. Thus, we can see that 1 is equivalent to 4. This 
completes the proof. □ 

C Proof of Corollary 4 

Proof. 1. To show part 1, we only need to set ui = u and U 2 = 0, and then plug them into 

the inequality (18) [note that Pc(0) = 0 since 0 € C\. 

2. Part 1 implies that ||Pc(u)|| < ||u||. Thus, we have 

||u|| 2 > ||u||||P c (u)|| > (u,P c (u)), 

which is equivalent to the statement in part 2. 

The proof is completed. □ 


D Proof of Theorem 5 

We first cite some useful properties of the projection operators. 

Lemma 10. (Ruszczyhski, 2006, Bauschke and Combettes, 2011) Let C be a nonempty closed 
convex set of a Hilbert space and u£C. Then 

1. N c ( u) = {v : P c (u + v) = u}. 

2. Pc(u + v) = u, Vv e Nc(u). 

3. Let u (/ C and u = Pc(u). Then, Pe(u + £(u — u)) = u for all t > 0. 

We are now ready to prove Theorem 5 

Proof. 

(i) For A e (0, A max ), Theorem 1 implies that y/A ^ T. Thus, the statement holds for 
A £ (0, A max ) by Theorem 2 and Eq. (13) [let u = y/A and u = 0*(A)]. 

To show the statement holds at A m ax, Theorem 2 indicates that we need to show 

(W <0,\/eex. (38) 

\ \ ^max / ^max / 
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Because g^ t (•) is convex, we have (Ruszczyriski, 2006) 


>(vgtA^-),e-^-)■ (39) 

\ ^max / \ \ ^max / ^max / 

Note that, gi is the constraint function of the dual problem in (12). Thus, for any dual 
feasible solution 9 e J 7 , it is evident that ge„(0) < 1. Moreover, Eq. (17) implies that 
gt, (y/^max) = 1- Therefore, the left hand of the inequality (39) must be non-positive, 
which yields inequality (38). Thus, the statement holds. 

(ii) A direct application of part 2 of Corollary 4 yields 


^,n(A 0 )^ = (^,^r ( Ao))>0,VA oe (0,A max ). 

When A 0 = A max , by noting that n(A max ) = V^j^), we have 



Thus, the statement holds. 

(iii) By Eq. (21), we have 

(r(A, A 0 ),n(A 0 )) = Q - (y,n(A 0 )) + - 6»*(A 0 ), n(A 0 )^ . (40) 

By Eqs. (20) and (13), the second term on the right hand side of Eq. (40) is nonnegative 
for all A 0 e (0, A max ]. 

The fact that 0 6 J yields 

0- —,n(A max )\ < 0. 

^max / 


Thus, the first term on the right hand side of Eq. (40) is nonnegative for Ao = A max . For 
Aq £ (0, A max ), part 2 of Corollary 4, Eqs. (13) and (20) imply that 


/ y_ y_ 
\ Ao ’ Ao 




> o. 


Thus, the first term on the right hand side of Eq. (40) is nonnegative for Ao 6 (0, A max ). 
As a result, the inner product (r(A, Ao), n(Ao)) is nonnegative. 

(iv) We define 


9(t) = 0*(X o ) + tn(A 0 ). (41) 

Part 1 of Lemma 10 implies that 

PAW)) = 0*(Ao), Vf >0. (42) 
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The nonexpansiveness of the projection operators yields [let ui = y/A and 112 = 9(t) and 
plug them into (18)] 


p^g) - p Am) 


+ ii(p^-id)g)-(p^-id)(0(t))n 2 < 


y 

A 



Vi > 0. 


By Eqs. (13), (42) and (21), the above inequality reduces to 


\\e* (A) - 6* (Ao)|| 2 + ||r (A) - 0* (Ao) - (r(A, A 0 ) - tn(A 0 ))|| 2 < ||r(A, A 0 ) - in(A 0 )|| 2 , Vi > 0. 

(43) 


Let us consider 


min r(t) = ||r(A,A 0 ) -in(A 0 )|| 2 . 


(44) 


Because r(t) is a quadratic function of t, we can see that 

|r(A, A 0 )|| 2 , if (r(A, A 0 ), n(A 0 )) < 0, 


min r(t) = 
t> 0 


■(A,A 0 )|| 2 , if (r(A, A 0 ), n(A 0 )) > 0. 


Because of part 3, we have 


min r(t) 

t> 0 


r± (A, A 0 ) || 2 


argmin r(i) 
t> 0 


(r(A, A 0 ),n(A 0 )) 
l|n(A 0 )|| 2 


(45) 

(46) 


Plugging Eqs. (45) and (46) into (43) yields the statement, which completes the proof. 


The proof is complete. 


□ 
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